Notebook by: Syaiful Andy

Task

Task A simple yet powerful marketing technique is an analysis utilizing recency (how recent was the customer's last purchase), frequency (how often did the customer make a purchase in a given period) and monetary (how much money did the customer spend in a given period) data to identify the best customers and perform targeted marketing campaigns.

As data scientist you are asked to segment the customers using transaction data and profile them based-on their characteristics (recency, frequency, monetary). After you find the segments, name them using understandable words so marketing team can easily create campaign strategies.

Data: ../data/transactions.csv

Hints: For each customer id, generate time difference between their last transaction and today. You should also calculate number of transaction and total amount of spending. You are allowed to use SQL.

Output: Push the executed notebook into your github repo and submit the URL to ketua kelas no later than August 21, 2021. Note that notebook must contain explanatory analysis and clustering as well as story about your findings.

Goodluck!

Load Data & Data Cleansing

From the output we can see we have invalid data in trans_date column, so we will drop the row that contain trans_date 29-Feb-17.

Recency (how recent was the customer's last purchase)

Frequency (how often did the customer make a purchase in a given period)

Monetary (how much money did the customer spend in a given period)

Join 3 Features (Recency, Frequency, Monetary)

Data Exploration

Data Visualization

From the plot we can see correlation of frequency and trans_amount variable

Standardized Variable

K-Means: with 2 variables with speculated k

Lets segment the data based on 2 variables: Recency and frequency

k_means.cluster_centers_ outputs the centroid based on standardized-data. We can easily calculate the centroid using original data

Conclusion 3 cluster K-Means with 2 Variable:

  1. Green color in image: Seems like loyal customer, have recent purchase and have high purchase frequency
  2. Red color in image: Seems like Low/Medium loyal customer, have recent purchase but low/medium purchase frequency
  3. Blue color in image: Seems like Low/Not loyal customer, because have not purchase for medium/long time and have low/medium purchase frequency

K-Means (2): 2 Variables with Elbow Method and Silhouette to Determine k

From elbow method the best k is 4 but form silhouette method the best k is 3 cluster. We can choose to use K = 4 to make clustering more specific

Conclusion 4 cluster K-Means with 2 Variable:

  1. Red color in image: Seems like loyal customer, have recent purchase and have high purchase frequency
  2. Green color in image: Seems like Low/Medium loyal customer, have recent purchase but low/medium purchase frequency
  3. Purple color in image: Seems like New Customer because have recently purchase but have low/medium purchase frequency
  4. Blue color in image: Seems like Not loyal customer, because have not purchase for long time and have low/medium purchase frequency

K-Means (3): 3 Variables with Elbow Method and Silhouette to Determine k

From elbow method the best k is 4 but form silhouette method the best k is 2 cluster. We will choose to use K = 4 to make clustering more specific

Conclusion 4 cluster K-Means with 3 Variable:

  1. Label 1 or red dot in 3d plot : Seems like very loyal customer, have recent purchase, have high purchase frequency, and high transaction amount
  2. Label 3 or blue dot in 3d plot: Seems like Medium loyal customer, have recent purchase, medium purchase frequency, and medium transaction amount
  3. Label 0 or purple dot in 3d plot: Seems like New Customer, have recent purchase but have low purchase frequency, and low transaction amount
  4. Label 2 or green dot in 3d plot: Seems like Not loyal customer, because have not purchase for long time, have low purchase frequency and low transaction amount

Hierarchical Clustering : Agglomerative

Conclusion 4 cluster Hierarchical Agglomerative Clustering with 3 Variable:

  1. Label 2 or green dot in 3d plot : Seems like very loyal customer, have recent purchase, have high purchase frequency, and high transaction amount
  2. Label 0 or blue dot in 3d plot: Seems like Low-Medium loyal customer, have recent purchase, Low/medium purchase frequency, and Low/medium transaction amount
  3. Label 1 or red dot in 3d plot: Seems like Low Loyal Customer, have not purchase in medium long time, have low/medium purchase frequency, and low transaction amount
  4. Label 3 or purple dot in 3d plot: Seems like Not loyal customer, because have not purchase for long time, have low purchase frequency and low transaction amount

Density-based Clustering: DBSCAN

Conlclusion 4 Cluster DBSCAN with 3 Variable:

When we make 4 cluster with DBSCAN, we will get a lot of outlier

Comparing K-Means, Hierarchical, and DbScan with 4 number of clusters

Conclusion:

K-Means seems better than Hierarchical and DBSCAN when using this dataset.

  1. In Hierarchical Clustering one cluster have very big number of customer (6290) and one cluster have very few number of customer (10).
  2. In DBSCAN one cluster have very big number of customer (5370) and very big number of outliers (1484)
  3. In K-Means, 4 cluster have not so far in number of customer (1765 vs 1738 vs 641 vs 2745)